In this document, we carry out a simulation exercise to evaluate the inference properties of different research designs aiming at measuring the short-term effects of air pollution on health. We consider different types of quasi experiments and for each associate an identification strategy.
This document focuses on running the simulations. The results are analyzed and discussed in another document.
We use data from 18 cities in France over the 2013-2018 period. The data set contains records of hospital admissions and deaths, mean concentration data for various air pollutants, a bunch of weather variables and calendar control variables (such as school holidays for instance). All variables are at the daily and city level. There is therefore a unique observation per date and per city in the data set.
Note that, for now we focus on quasi experiments for which treatment is binary and homogeneous, both in time and across individuals. It is also not correlated with covariates (apart in the case of the air pollution alerts in which it is correlated with the pollutant level).
Here, we consider interventions leading to changes in air pollution levels on some random days. Examples of such interventions include transportation strikes. Of course, dates are often not defined as random and are likely to be correlated with unobserved variables. In the present setting, we first consider the golden standard case in which these days are actually defined at random. One can think of this case as a Randomized Control Trial: it represents what would happen if an experimenter could implement a treatment increasing pollution on random days.
Here consider interventions leading to changes in air pollution levels in all cities after a given date. Examples of such interventions include an air pollution reduction policy at a national level or a change in regulations at a national level leading to an increase in pollution level.
Here, we consider interventions leading to changes in air pollution levels in a subset of cities after a given date. Examples of such interventions include an air pollution reduction policy at a sub-national level or a change in regulations at a sub-national level leading to an increase in pollution level.
Here, we consider interventions leading to changes in air pollution levels in a all cities but the change happens after a different date for each city. Examples of such interventions include an air pollution reduction policy at a national level whose implementation is rolled out.
Here we consider interventions that affect exposure to air pollution when air pollution levels reach a given threshold. Examples of such interventions include air pollution alerts: when pollution reaches a certain level, alerts are released, inviting people to reduce their exposure.
We finally consider a case for which there is no quasi experiment. One can consider that in this case all units are treated. To measure the effect of air pollution on health in this case, we will use methods from the epidemiology literature, as discussed in the next section.
In order to estimate the parameters of interest, we use several identification methods. Most quasi-experiments are associated with a given identification method.
The effect of a treatment on random days is estimated using a RCT type of model. The overall idea of the RCT is to compare the average number of deaths or hospital admissions in cities with treatment to cities with no treatment on the same day, controlling for differences across cities.
This identification method enables to estimate the Average Treatment Effect (ATE), ie a difference in mean between the treated and the control group (\(\mathbb{E}[Y_{1i} - Y_{0i}]\)). To do so, we use the following type of model:
\[Y_{ct} = \alpha + \beta D_{ct} + \epsilon_{ct}\] where, as in the whole document, \(Y_{ct}\) is the health outcome of interest (for instance mortality), in a city \(c\), at date \(t\). The parameter of interest is \(\beta\) and \(D_{ct}\) is a dummy equal to 1 if the city \(c\) is treated at time \(t\) and 0 otherwise.
The identification assumption here is that the the potential outcomes are independent of the treatment (independence assumption).
We use a Difference-In-Differences (DiD) model to estimate the effect of a sub-national treatment or a rolling intervention. The overall idea of the DID is to compare the average number of deaths or hospital admissions in cities with treatment to cities with no treatment on the same day, controlling for differences across cities.
This identification method enables to estimate the Average Treatment Effect on the Treated (ATET), ie \(\mathbb{E}[Y_{1i} - Y_{0i} | Treated_c = 1, Post_t = 1]\), where \(Post_t\) is a dummy variable equal to one after the intervention and 0 otherwise. We thus have \(D_{ct} = Treated_c \times Post_t\). To estimate this effect, we use the following type of model:
\[Y_{ct} = \alpha + \beta D_{ct} + \gamma Treated_c + \delta Post_t + \epsilon_{ct}\] The key identification assumptions is parallel trends, ie the expected difference in outcome between the treatment and control group would have remained the same in the absence of the treatment.
We use Interrupted Time Series (ITS) type of models to estimate the effect of a national treatment. The overall idea of the ITS is to compare the average number of deaths or hospital admissions before and after treatment.
This identification method enables to estimate the ATET. To do so, we use the following type of model:
\[Y_{ct} = \alpha + \beta D_{ct} + \gamma t + \delta t^{post}_t + \epsilon_{ct}\] where \(t^{post}_t\) is a time variable representing the number of days after the intervention (equal to 0 before the intervention).
We use a Regression Discontinuity Design (RDD or RD) to estimate the effect of an air pollution alert type of intervention. The overall idea of the RD is to compare days just below the threshold to days just above the threshold (where exposure and health impacts are thus lower). The key identification assumption is that days just below and just above the threshold are comparable. Thus, no confounders should vary discontinuously at the threshold (local independence, \((Y_{0i}, Y_{1i}) \perp D_i|Z_i\), for \(Z_i \in [c-a, c+a]\) ) and the treatment should vary at threshold (relevance, \(D_i = \mathbb{1} \{Z_i \geq c\}\)).
This identification method enables to estimate a Local Average Treatment Effect (LATE) at the cutoff, ie \(\mathbb{E}[Y_{1i} - Y_{0i}|Z_i = c]\). To do so, we use the following type of model, but restricting our sample to observation just below and just above the threshold:
\[Y_{ct} = \alpha + \beta D_{ct} + \epsilon_{ct}\]
In all previous identification methods, we only tried to estimate a reduced form, ie looking at the effect of a treatment directly on mortality. We did not consider the effect of the treatment on air pollution, ie the mechanism. In this section, we instrument the effect of pollution on the outcome with an instrument/treatment. We basically model the effect of an exogeneous instrument on air pollution and use this information to retrieve a causal estimate of the short term effect of air pollution on health. Note that the class of treatments/instruments considered here is broader than in the previous section.
In a first step, we only consider binary instruments such as thermal inversions and high/low wind speed for instance. This type of instruments corresponds to a large share of the instruments considered in the literature. The key assumption is that these treatments only affect the health outcome variable via their effect on air pollution.
Let’s denote \(Z\) this instrument. We compute a 2-Stages Least Squares (2SLS) where the first stage has the form:
\[Poll_{ct} = \gamma + \delta Z_{ct} + e_{ct}\] and the second stage:
\[Y_{ct} = \alpha + \beta Poll_{ct} + \epsilon_{ct}\]
Finally, we also consider simple linear models in order to measure the correlation between the health outcome of interest and air pollution, controlling for potential confouders. The identification assumptions here are the usual OLS assumptions.
We estimate a model of the form:
\[Y_{ct} = \alpha + \beta Poll_{ct} + \epsilon_{ct}\]
To sum up the analysis, the aim is to measure the inference properties of different research designs aiming at measuring the short-term effects of air pollution on health. For simplicity, consider the daily number of death as the output variable of interest for now. We proceed as follows:
Note that there are several parameters we can vary in order to evaluate the performance of the different methods: the identification strategy, the number of observations, the proportion of treated days, the true effect size and the model. In order to limit the number of simulations and for clarity, we only modify one parameter at the time, keeping others constant and equal to a baseline value. When we vary a parameter, we consider the following values:
We consider this set of parameters for each quasi-experiment.
In addition to considering an “ideal” case in which treatment is allocated randomly, the effect of the treatment is homogeneous and with perfect compliance, we can consider deviations from this “ideal” case:
| Allocation | Compliance | Effect of the treatment |
|---|---|---|
| Random | Yes | Homogeneous |
| Random | No | Homogeneous |
| Correlated with covariates | Yes | Homogeneous |
| Random | Yes | Heterogeneous |
Deviating from the ideal case therefore yields a very large number of cases. We limit ourselves to a sample of these cases.
In this section, we follow the steps described in the section “analysis” and carry out our analysis. We basically define a function for each step.
We load the packages and the data and wrangle it into a format well suited for this analysis.
First, we create a function to randomly select a study period of a given length. This function also randomly selects a given number of cities to consider. For simplicity, we choose to have the same temporal study period for each city. This also seems realistic; a study focusing on several cities would probably consider a unique study period.
This function randomly selects a starting date for the study, early enough so that the study can actually last the number of days chosen. It then randomly draws a given number of cities and filters out observations outside of the study period. It returns a data set with only the desired number of cities and days.
select_study_period <- function(data, n_days = 500, n_cities = 66) {
dates <- data[["date"]]
err_proof_n_days <- min(n_days, unique(dates))
cities <- data[["city"]]
err_proof_n_cities <- min(n_cities, length(unique(cities)))
begin_study <- sample(
seq.Date(
min(dates),
max(dates) - err_proof_n_days,
"day"
), 1)
dates_kept <- dplyr::between(
dates,
begin_study,
begin_study + err_proof_n_days
)
cities_kept <- cities %in% sample(levels(cities), size = err_proof_n_cities)
data_study <- data[(dates_kept & cities_kept),]
return(data_study)
}
Then, we create a function to draw the treatment. This function returns a boolean vector, stating whether each observation is in the treatment group or not.
For each treatment, we define the proportion of treated units (p_treat). The different treatments correspond to different types of quasi-experiments. Note that we will consider some identifications strategies that will only consider restricted samples (RDD and RDIT). For coding simplicity, to get an accurate proportion of treated units, we directly restrict the sample in this function. For the RDIT, we create another type of quasi experiment (“national_short”), even though the quasi experiment is actually the same as for a national intervention.
In this function, we made some simulations decisions:
p_treat cities.p_treat observations of the total sample are treated. Observations outside the bandwidth get a NA for treated. Thus, after calling this function, we need to filter out observations with a NA for treated.p_treat, we do not draw this date from the whole sample. We only consider dates close enough to the end of the study period such that, on average only p_treat of the data will be treated.p_treat. Note that it does not make sense to have a large p_treat as this would correspond to the national case. We therefore limit p_treat to be smaller than 0.5.draw_treated <- function(data, p_treat = 0.5, quasi_exp = "random_days") {
if (quasi_exp == "random_days") {
data[["treated"]] <- rbernoulli(length(data[["date"]]), p_treat)
} else if (quasi_exp == "national") {
num_date <- as.numeric(data[["date"]])
data[["treated"]] <- (num_date >= quantile(num_date, 1 - p_treat))
} else if (quasi_exp == "local") {
treated_cities <- unique(data[["city"]]) %>%
sample(size = round(length(.) * min(p_treat * 2, 1)))
data[["treated"]] <- (data[["city"]] %in% treated_cities &
data[["date"]] >= median(data[["date"]]))
} else if (str_starts(quasi_exp, "alert")) {
pollutant <- str_extract(quasi_exp, "(?<=_).+")
threshold_pos <- rbeta(1, 20, 2)
data <- data %>%
group_by(.data$city) %>%
mutate(
threshold = quantile(.data[[pollutant]], threshold_pos, names = FALSE),
bw_high = quantile(.data[[pollutant]], min(threshold_pos + p_treat, 1), names = FALSE),
treated = ifelse(
.data[[pollutant]] > bw_high | .data[[pollutant]] < threshold - (bw_high - threshold),
NA,
(.data[[pollutant]] >= threshold)
)
) %>%
select(-bw_high, -threshold) %>%
ungroup()
} else if (quasi_exp == "rolling") {
data <- data %>%
group_by(.data$city) %>%
mutate(
treated = (.data$date > sample(
seq.Date(
min(.data$date) + (1 - 2*p_treat)*length(.data$date),
max(.data$date), "day"
), 1))
) %>%
ungroup()
} else if (quasi_exp == "national_short") {
dates <- unique(data[["date"]])
bw <- floor(min(p_treat, 0.5) * length(dates))
date_start_treat <- sample(seq.Date(min(dates) + bw, max(dates) - bw, "day"), 1)
data <- data %>%
mutate(
treated = ifelse(
.data[["date"]] > date_start_treat + bw |
.data[["date"]] < date_start_treat - bw,
NA, (.data[["date"]] >= date_start_treat))
)
} else {
data[["treated"]] <- TRUE
}
return(data)
}
Both to verify that everything works well and for illustration, we can make quick plots:
We then create the output if a unit is treated, Y(1). We use different methods to generate fake health data:
We also create a short function to detect a pollutant among independent variables of a formula. We also use this function later.
For the IV, we include the intensity of the IV in the method name. This is not very clean coding style as we should have created a new parameter. Yet, it makes the code much simpler and ultimately much clearer.
find_pollutant <- function(formula) {
pollutants_list <- c("co", "pm10", "pm2.5", "no")
pollutant <- pollutants_list[pollutants_list %in% all.vars(formula)]
return(pollutant)
}
create_y1 <- function(data,
percent_effect_size = 0.5,
id_method = "RCT",
formula) {
fml <- Formula::as.Formula(formula)
if (id_method %in% c("RCT", "DID", "ITS", "RDIT", "RDD")) {
dep_var <- all.vars(fml)[[1]]
data <- data %>%
group_by(.data$city) %>%
mutate(
y1 = .data[[dep_var]] +
rpois(
n(),
mean(.data[[dep_var]], na.rm = TRUE) * percent_effect_size / 100
) %>% suppressWarnings() #warnings when is.na(dep_var) eg rpois(1, NA)
) %>%
ungroup()
} else if (str_starts(id_method, "OLS|IV")) {
pollutant <- find_pollutant(fml)
reg <- feols(data = data, fml = fml, combine.quick=FALSE)
if (str_starts(id_method, "IV")) {
iv_intensity <- as.numeric(str_extract(id_method, "(?<=_).+"))
data[[pollutant]] <-
data[[pollutant]] +
iv_intensity*data[["treated"]] +
rnorm(data[[pollutant]], 0, 0.5)
pollutant <- str_c("fit_", pollutant)
}
reg$coefficients[[pollutant]] <- percent_effect_size/100
res <- reg$residuals
data[["y1"]] <- round(exp(predict(reg, data) +
rnorm(res, mean(res), sd = sd(res))))
}
return(data)
}
# create_y1(draw_treated(nmmaps_data), id_method = "IV_0.05", formula = "log(resp_total) ~ temperature | city | co ~ treated") %>% select(resp_total, y1)
We can then estimate our model and retrieve the point estimate and p-value. The model should be specified in a three part formula as follows: y ~ x | fixed effects | cluster. If one does not want to set fixed effects, a 0 should be put in the second part of the formula: y ~ x | 0 | cluster.
estimate_model <- function(data, formula) {
fml <- Formula::as.Formula(formula)
pasted_formula <- paste(str_c(fml)[[2]], str_c(fml)[[1]], str_c(fml)[[3]])
#the param of interest varies across identification methods
param_of_interest <-
ifelse(str_count(pasted_formula, "\\|") == 2, #if IV (ie 3 parts rhs in formula)
str_c("fit", find_pollutant(fml), sep = "_"),
ifelse("treated" %in% all.vars(fml), "treatedTRUE", find_pollutant(fml)))
#run the estimation
est_results <- feols(data = data, fml = fml, se = "hetero")
#retrieve the useful info
nobs <- length(est_results$residuals)
est_results %>%
broom::tidy(conf.int = TRUE) %>%
filter(term == param_of_interest) %>%
rename(p_value = p.value, se = std.error) %>%
select(estimate, p_value, se) %>%
mutate(n_obs = nobs)
}
# estimate_model(nmmaps_data, "resp_total ~ co + temperature | city")
We then want to compute simulations by running all the previous functions together. Yet, for the DiD and the ITS/RDIT, we first need to create additional dummy variables (post and city_treated for the DiD and t and t_post for the ITS/RDIT). We thus write a function which does so, add_post_vars:
add_post_vars <- function(data, id_method) {
if (id_method == "DID") {
out_data <- data %>%
group_by(city) %>%
mutate(city_treated = as.logical(max(treated))) %>%
ungroup() %>%
group_by(date) %>%
mutate(post = as.logical(max(treated))) %>%
ungroup()
} else if (id_method %in% c("ITS", "RDIT")) {
out_data <- data %>%
mutate(
date_num = as.numeric(date),
t = date_num - min(date_num),
t_post = date_num - as.numeric(min(.data$date[treated == TRUE])),
t_post = ifelse(t_post < 0, 0, t_post)
)
} else {
out_data <- data
}
return(out_data)
}
We also need to compute the true effect for each simulation. The different identification methods aims to identify different estimands (ATE, ATET, etc). We
compute_true_effect <- function(data, id_method, percent_effect_size) {
effect_size <- percent_effect_size
if (id_method == "DID") {#ATET
true_effect <- data %>%
filter(post, city_treated) %>%
summarise(mean(y1 - y0, na.rm = TRUE)) %>%
as.numeric()
} else if (id_method %in% c("ITS", "RDIT")) {#ATET
true_effect <- data %>%
filter(t_post > 0) %>%
summarise(mean(y1 - y0, na.rm = TRUE)) %>%
as.numeric()
} else if (id_method == "OLS") {
true_effect <- effect_size
} else {#ATE
true_effect <- mean(data$y1 - data$y0, na.rm = TRUE)
}
return(true_effect)
}
We can now create a function running all the previous functions together and therefore performing an iteration of the simulation for a given set of parameters. This function returns a one row data set with estimate, p-value, number of observations and true effect size.
compute_simulation <- function(data,
n_days = 500,
n_cities = 66,
p_treat = 0.5,
percent_effect_size = 0.5,
quasi_exp = "random_days",
id_method = "RCT",
formula = "resp_total ~ treated") {
fml <- Formula::as.Formula(formula)
dep_var <- all.vars(fml)[1]
sim_data <- data %>%
select_study_period(n_days, n_cities) %>%
mutate(y0 = .data[[dep_var]]) %>%
draw_treated(p_treat, quasi_exp) %>%
create_y1(percent_effect_size, id_method, fml) %>%
mutate(
yobs = y1 * treated + y0 * (1 - treated)
) %>%
filter(!is.na(treated)) %>% #not necessary bc dropped in lm()
add_post_vars(id_method)
updated_fml <- ifelse(
str_starts(id_method, "IV|OLS"),
formula,
str_replace(formula, ".+(?=~)", "yobs ")
)
sim_output <- sim_data %>%
estimate_model(formula = updated_fml) %>%
mutate(
true_effect = compute_true_effect(sim_data, id_method, percent_effect_size)
)
return(sim_output)
}
test <- nmmaps_data %>%
compute_simulation(
data = .,
formula = "resp_total ~ treated + temperature | city",
quasi_exp = "random_days",
id_method = "RCT",
n_days = 500
# quasi_exp = "local"
)
We will then loop this function to get a large number of replications of each simulation for a given set of parameters. We will also vary the values of the different parameter.
We can then run the simulations. But first, we define the set of parameters we want to consider.
We create a table displaying in each row a set of parameters we want to have a simulation for, sim_param. We will then map our function compute_simulation on this table.
To build sim_param, we first define a set of baseline values for our parameters and store them in a data frame, sim_param_base. We will then vary the values of the parameters one after the other. We thus create vectors containing the different values of the parameters we want to test.
sim_param_base <- tibble(
n_days = 500,
n_cities = 66,
p_treat = 0.2,
percent_effect_size = 0.5,
formula = "resp_total ~ treated + temperature + I(temperature^2) | city + month + year + weekday"
)
write_csv(sim_param_base, "../Outputs/sim_param_base.csv")
vect_n_days <- c(100, 250, 500, 750, 1000)
vect_n_cities <- c(66)
vect_p_treat <- c(0.01, 0.2)
vect_percent_effect_size <- c(0.1, 0.5, 1)
vect_formula <- c(
"resp_total ~ treated",
"resp_total ~ treated | city",
"resp_total ~ treated | city + month + year + weekday",
"resp_total ~ treated + temperature + I(temperature^2) | city + month + year + weekday"
)
We then want to create the actual table, varying parameters one after the other. To do so, we create a simple function add_values_param. This function adds the values of a parameter contained in a vector. We can then map this function on all the vectors of parameters of interest.
#adds all values in vect_param
add_values_param <- function(df, vect_param) {
param_name <- str_remove(vect_param, "vect_")
tib_param <- tibble(get(vect_param))
names(tib_param) <- param_name
df %>%
full_join(tib_param, by = param_name) %>%
fill(everything(), .direction = "downup")
}
vect_of_vect_param <- c("vect_n_days", "vect_n_cities", "vect_p_treat", "vect_percent_effect_size", "vect_formula")
sim_param_unique <-
map_dfr(
vect_of_vect_param,
add_values_param,
df = sim_param_base
) %>%
distinct() #bc base parameters appear twice
We want to compute our simulations for this set of parameters for every quasi experiment. Note that, in order to identify the effect of interest, we consider different types of models, depending on the identification strategy considered for each quasi experiment.
Then, for each set of parameters we want to run many iterations of the simulation, n_iter. We thus modify sim_param so that each set of parameter appears n_iter times. It will enable us to loop compute_simulation directly on sim_param.
vect_quasi_exp <- c("random_days", "national", "local", "alert_co", "national_short", "rolling", "none") #, "alert_co"
n_iter <- 1
sim_param_to_edit <- sim_param_unique %>%
crossing(vect_quasi_exp) %>%
rename(quasi_exp = vect_quasi_exp) %>%
# mutate(p_treat = ifelse(str_starts(quasi_exp "alert"), p_treat/10, p_treat) %>%
mutate(
id_method = case_when(
quasi_exp %in% c("local", "rolling") ~ "DID",
str_starts(quasi_exp, "national") ~ "ITS",
str_starts(quasi_exp, "alert") ~ "RDD",
quasi_exp == "none" ~ "OLS",
),
id_method = ifelse(
quasi_exp == "random_days",
list(c("RCT","IV_0.2")),
id_method
)
) %>%
unnest(id_method) %>%
mutate(
formula = ifelse(
str_starts(id_method, "OLS|IV"),
str_c("log(", str_extract(formula, ".+(?=\\s~)"), ") ~ ", str_extract(formula, "(?<=~\\s).+")),
formula),
formula = case_when(
id_method == "DID" ~ str_replace_all(formula, "treated",
"treated + post + city_treated"),
id_method == "ITS" ~ str_replace_all(
formula, "treated",
"treated + t + t_post"),
id_method == "RDD" ~ str_replace_all(formula, "treated", paste(
"treated +", str_remove(quasi_exp, "alert_")
)),
id_method == "OLS" ~ str_replace_all(formula, "treated", "co"),
str_starts(id_method, "IV") ~ paste(
str_remove_all(formula, "(\\+\\s)?\\btreated\\b(\\s\\+)?"),
"| co ~ treated"
),
TRUE ~ formula
)
) %>%
filter(!str_detect(formula, "~\\s{0,2}\\|"))
write_csv(sim_param_to_edit, "../Outputs/sim_param_to_edit.csv")
#Here, check everything and edit by hand
sim_param_edited <- read_csv("../Outputs/sim_param_edited.csv")
sim_param <- sim_param_edited %>%
crossing(rep_id = 1:n_iter) %>%
select(-rep_id) %>%
mutate(sim_id = row_number()) #for merging with compute_simulation() results
# write_csv(sim_param, "../Outputs/sim_param.csv")
We can then run the simulations for each set of parameter, using a pmap_dfr function.
tic()
simulations <- future_pmap_dfr(
sim_param %>% select(-sim_id),
compute_simulation,
data = nmmaps_data,
.id = "sim_id",
.options = furrr_options(seed = TRUE)
)
toc()
all_simulations <- simulations %>%
as_tibble() %>%
mutate(sim_id = as.numeric(sim_id)) %>%
left_join(sim_param, by = "sim_id") %>%
select(-sim_id)
# saveRDS(all_simulations, "../Outputs/data_simulations.RDS")
We then summarize our results, computing power, type M and so on for each set of parameters using he function summarise_simulations. Note that this function can only take as input a data frame produced by compute_simulation (or a mapped version of this function).
summarise_simulations <- function(data) {
data %>%
group_by(formula, quasi_exp, n_days, p_treat, percent_effect_size) %>%
summarise(
power = mean(p_value <= 0.05, na.rm = TRUE)*100,
# average_p_value = mean(p_value, na.rm = TRUE),
average_n_obs = mean(n_obs, na.rm = TRUE),
type_m = mean(ifelse(p_value <= 0.05, abs(estimate/true_effect), NA), na.rm = TRUE),
type_s = sum(ifelse(p_value <= 0.05, sign(estimate) != sign(true_effect), NA), na.rm = TRUE)/n()*100,
mse = mean((estimate - true_effect)^2, na.rm = TRUE),
normalized_bias = mean(abs((estimate - true_effect/true_effect)), na.rm = TRUE),
estimate_true_ratio = mean(abs(estimate/true_effect), na.rm = TRUE),
.groups = "drop"
) %>%
ungroup()
}
We then run this function.
summary_simulations <- summarise_simulations(all_simulations)
# saveRDS(summary_simulations, "../Outputs/summary_simulations.RDS")